Project Details

In the past few years, it seems like the housing market in Kansas City has really taken off, so I thought it would be great to take a look at the time series of average home selling price of the past ten years and use it to try to forecast where home prices will be going moving forward. And besides looking solely at average home selling price I wanted to bring in some other time series variables as well to help aid the forecasting and get a better insight of what effects home selling prices. I decided for all my forecast for this time series will go out 36 months, or 3 years, to help to see the full effect of where the forecast is going out to.

Preparation

Before the I start the analysis I need to get the R environment ready by loading the packages, functions, and data I will need.

Loading Packages

First I load the packages I will be using throughout the project, also I set how numbers will be displayed throughout the project, there was a tendency through the project for the output to turned into the scientific form, so for consistency I wanted to make sure that all the numbers were outputted in a similar format.

require(fma)
require(ggplot2)
require(ggthemes)
require(magrittr)
require(zoo)
require(fBasics)
require(fpp)
require(forecast)
require(knitr)
require(lmtest)
require(corrplot)
require(mvtsplot)
options("scipen"=1, "digits"=4)

Loading Function

The next thing was to load a custom function that I have used in many different assignments through the year. It allows for plots to be plotted together very easily which is extremely useful when comparing multiple charts, it is the best when the plots can be placed right next to each other for reference.

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Loading Data

And lastly, before I can start the analysis I need to load the dataset that I will be performing the analysis on. The dataset I am using is a custom dataset that has information for the past 10 years, broken down by month, on the average home selling price in Kansas City, which is the main variable that I will be looking at throughout the project, but I also have the number of home listing in Kansas City, the number of new housing permits in Kansas City, and the Housing Price Index for the United States. After I load the dataset I immediately turn the first variable of the average home selling price in Kansas City into a time series so that I can begin analysis.

kchouse <- read.csv("kc_housing.csv")
kcts <- ts(kchouse$avg_price, start=c(2007,8),frequency=12)
#kcts

Single Variable Time Series

The analysis we will begin with the single variable time series, for which the time series data has already been created.

Plots of Time Series

The first analysis that will be done on the time series is the average selling place of homes in Kansas City will be a visual analysis.

Time Series Plot

A simple time series plot will help to aid in understanding the dataset.

kcts %>% autoplot()+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
  labs(title = "Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

A lot can be learned from this first plot. The first thing that is noticed is that there is clear seasonality to the time series, which appears to be a yearly season. Also there is an in a positive trend to the dataset, but there was a definite downturn of the data between the start of the data in August of 2007 where it bottomed out in January of 2009, which matches the timeframe of the last financial crisis, but what it is very interesting is that is seems like the downturn in the housing market happened before the downturn in the stock market.

But a few years after the crisis home prices stayed fairly stable. After the drop in 2008 for the next 3 years, home prices stayed relatively flat before starting it 7 plus year rise in average home prices.

Also in looking at the chart, it seems like the dataset is not stationary, or in other words that the data is not normalized. A simple way to tell this is that the mean of a normalized data is zero, which this dataset clearly is not. There are other visual and statistical tests that can be run to confirm this.

Decomposition Plots

The next visual analysis will be a decomposition analysis.

autoplot(stl(kcts, s.window="periodic", robust=TRUE))+
  theme_minimal()+
  labs(title = "Decomposition of Average Kansas City Home Selling Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

#decompose(kcts)%>%autoplot()+theme_minimal()+scale_colour_calc()

The decomposition plot shows the trend, the yearly seasonal effect, and the error of the time series. It shows that there is indeed a yearly seasonality in the dataset. But it also shows that the trend of the data is not what it originally appeared. In the first plot it seemed like the home prices stabilized after 2008, but when the seasonality has been removed it is clear to see that prices did not stabilize, it actually continued to drop until the first half of 2011. What this plot shows is that there is also remainder in the dataset that is not explained by just the trend and seasonality of the dataset alone.

These remainders are going to be a bit of an issue when trying to predict the future forecast based on the single time series alone, but hopefully, when the other variables are brought in later they will help to explain these remainders in the data.

ACF and PACF Plots

For the next visual analysis the ACF and PACF of the dataset.

p1<-ggAcf(kcts) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF of Average Kansas Ctiy Home Selling Price')

p2<-ggPacf(kcts) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')

multiplot(p1,p2, cols =1)

The ACF plots confirm that the data is highly correlated with itself, showing at least autocorrelation for 24 lag. And with the gentle slope of the ACF plot is tell us that the data is non-stationary, meaning that the dataset is not normally distributed. And the spikes of correlations at the 12 and 24 lag marks helps to confirm our yearly seasonality effect. This plot helps to show the data is not just white noise, but that can not be told when the data is not normalized.

The PACF plot shows that once the lag effect of the dataset have been removed that there is still autocorrelation in the dataset, meaning that the data has partial autocorrelation. The spikes at the 1 lag and 13 also show there is yearly seasonality as well.

Histogram

Now for analysis, we look at the histogram and density plot of the dataset.

ggplot(kcts, aes(x=y)) +
  geom_histogram(aes(y=..density..),binwidth=5000,  fill = "goldenrod")+
  geom_density(aes(y=..density.., colour ="Navy")) +  
  geom_vline(aes(xintercept=mean(y, na.rm=TRUE)),color="grey69", linetype="dashed", size=1)+
  labs(title = "Histogram of Average Kansas City Home Selling Price", x = "Price", y = "Density") + 
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

This histogram and density plot provide a lot of information about the dataset still. It shows that the dataset is not normally distributed, the data is widely distributed, is skewed to the left, and the mean is not 0.

QQ-Plot

The last visual analysis that will be done is the QQ-Plot.

y     <- quantile(kcts, c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x     <- qnorm( c(0.25, 0.75))         # Find the matching normal values on the x-axis
slope <- diff(y) / diff(x)             # Compute the line slope
int   <- y[1] - slope * x[1]  

ggplot() + 
  aes(sample=kcts) + 
  stat_qq(distribution=qnorm) +  
  geom_abline(intercept=int, slope=slope, colour = 'navy') + 
  ylab("Height") +
  xlab("Theoretical")+
  ggtitle("QQ-Plot of Average Kansas City Home Selling Price") + 
  scale_x_continuous()+
  scale_y_continuous() + 
  theme_minimal() +
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

The QQ-Plot shows exactly what has seen above in the histogram and density plot, which the data is not normally distributed. That the data skews to left and is flat.

Statistical Tests of Time Series

Now that visual analysis of the dataset has been done, a statistical test will be performed as well.

Test for Normality

The first test that will be done is a test for normality, which from our visual test, we are led to believe that the data will not be normally distributed.

normalTest(kcts,method=c("jb")) 
## 
## Title:
##  Jarque - Bera Normalality Test
## 
## Test Results:
##   STATISTIC:
##     X-squared: 6.5054
##   P VALUE:
##     Asymptotic p Value: 0.03867 
## 
## Description:
##  Sat Feb 26 10:52:27 2022 by user:

The Jarque-Bara Normality Test is testing if the data is not normalized, with the data being normally distributed as the null hypothesis and the data being not normally distributed as the alternative hypothesis. And with a p-value under .05, at .03867, meaning that we reject the null hypothesis and accept that the dataset is not normally distributed.

Test for Autocorrelation

The next statistical test will be for autocorrelation in the first 24 lags of the data. The visual analysis showed that there was autocorrelation in the dataset.

Box.test(kcts, lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  kcts
## X-squared = 1100, df = 24, p-value <2e-16

With the Box-Ljung Test, the null hypothesis is that the data does not have autocorrelation and the alternative hypothesis is that the data does have autocorrelation. With the p-value of the test being under .05 significantly, with a value under 2e-16, we reject the null hypothesis and conclude that the data is autocorrelated.

Plots of Differenced Time Series

Now that is has been determined that the dataset itself is not normally distributed, one of the easiest ways to normalize the data is but taking the first difference of the dataset. Let do a visual analysis of the differenced average home selling price of Kansas City to see what can be found.

Time Series Plot

Just like with the first analysis, this analysis will begin with a simple time series plot.

kcts %>%diff%>% autoplot()+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
  labs(title = "Difference of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

The differenced plot of the average home selling prices in Kansas City looks completely different than the first plot but it shows the change in home price from month to month. From 2007 to 2009 there seems like the drops are greater than the gains and from 2013 on it seems like the gains are greater than the drop. It seems harder to make out exactly what is taking place with the seasonality in place, but it does appear that the dataset is now normalized because it appears that the mean is closer to 0.

Decomposition Plots

Next analysis will decompose the differenced dataset.

autoplot(stl(diff(kcts), s.window="periodic", robust=TRUE))+theme_minimal()+
  labs(title = "Decomposition of Difference of Average Kansas City Home Selling Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

The decomposition of the dataset shows that there is still a yearly seasonality effect that takes place. And the trend shows that from pretty much the start of the data until mid 2011 there is a negative trend, then from mid 2011 to the end of the data set the trend is positive, which reflects with what we saw with the regular average selling price data, a decline until mid 2011 then an increase. There is still a good amount of remainder left that the decomposition has not accounted for, that is to be expected.

ACF and PACF Plots

Now the ACF and PACF of the differenced dataset.

p1 <- ggAcf(diff(kcts)) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF Plots of Differenced Avereage Kansas City Home Selling Price')

p2<-ggPacf(diff(kcts)) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')

multiplot(p1,p2, cols =1)

The ACF of the differenced data by how quickly the slope goes to 0 leads me to believe that the data is now stationary and normally distributed, which is what we are hoping for. And the spikes at the 12 and 24 lags again point to the yearly seasonality. And now that we have determined that the data is normally distributed by the fact that their autocorrelation means that the data is not just white noise and the time series itself can be used in helping to predict forecasted time series.

The PACF plot shows the difference dataset when the lag effect of the dataset has been removed there is still autocorrelation present, meaning that the data has partial autocorrelation. Which can be used in in an autoregressive model. The spikes at the 12 lag show there is yearly seasonality as well, which will have to account for before these plots can be used in helping to determine an ARIMA model.

Histogram

Now to examine the new histogram and density plot.

ggplot(diff(kcts), aes(x=y)) +
  geom_histogram(aes(y=..density..),binwidth=5000,  fill = "goldenrod")+
  geom_density(aes(y=..density.., colour ="Navy")) +  
  geom_vline(aes(xintercept=mean(y, na.rm=TRUE)),color="grey69", linetype="dashed", size=1)+
  labs(title = "Histogram of Difference of Average Kansas City Home Selling Price", x = "Price", y = "Density") + 
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

This is histogram and density plot is exactly what we want to see with a normally distributed dataset. The shape of the distribution has the class bell shape, it seems like both sides are evenly distributed, and the mean of the data falls just about at 0 which is what is needed for a normal distribution.

QQ-Plot

And a QQ-Plot will be used for the last visual analysis of the difference dataset.

y     <- quantile(diff(kcts), c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x     <- qnorm( c(0.25, 0.75))         # Find the matching normal values on the x-axis
slope <- diff(y) / diff(x)             # Compute the line slope
int   <- y[1] - slope * x[1]  

ggplot() + 
  aes(sample=diff(kcts)) + 
  stat_qq(distribution=qnorm) +  
  geom_abline(intercept=int, slope=slope, colour = 'navy') + 
  ylab("Height") +
  xlab("Theoretical")+
  ggtitle("QQ-Plot of Difference of Average Kansas City Home Selling Price") + 
  scale_x_continuous()+
  scale_y_continuous() + 
  theme_minimal() +
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

With all the points of that QQ-Plot just about landing on or close to the abline, and with no pronouncing shape to the data points, this graph also points to the fact that the differenced dataset is now normally distributed.

Statistical Tests of Differenced Time Series

In completing the visual analysis of the differenced dataset, statistical analysis will now be performed.

Test for Normality

Again, the first test that will be done is a test for normality, which from our visual test we are led to believe that the differenced data is normally distributed.

normalTest(diff(kcts),method=c("jb")) 
## 
## Title:
##  Jarque - Bera Normalality Test
## 
## Test Results:
##   STATISTIC:
##     X-squared: 1.2211
##   P VALUE:
##     Asymptotic p Value: 0.5431 
## 
## Description:
##  Sat Feb 26 10:52:30 2022 by user:

With the p-value of the test well above the p-value of .05 at .5431, we will accept the null hypothesis that the differenced data is normally distributed, unlike the original dataset.

Test of Autocorrelation

The next will be the test for autocorrelation in the first 24 lags of the differenced data. The visual analysis showed that there was autocorrelation in the differenced dataset.

Box.test(diff(kcts), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  diff(kcts)
## X-squared = 130, df = 24, p-value <2e-16

In using the Box-Ljung Test with the resulting p-value of the test being under .05 significantly, with a value under 2e-16, we reject the null hypothesis and conclude that the data is autocorrelated.

Plots of Seasonality

Now in looking and test at the time series data and difference data, we can tell that there is there is seasonality to the data. And since the data is by month and the significant lags always seem to be separated by 12 lags it can be concluded that the data has yearly seasonality to it. To get a better look at this seasonality I decided to create some seasonality plots to better examine the information.

Seasonal Plots

The first plot I created is a seasonal plot, which is a plot of the season variables, which in our case is every month of the year, and plots all the yearly data together on some year time frame.

ggseasonplot(kcts) + labs(title="Seasonal Plot of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017", y = "Price")+theme_minimal()+
theme(legend.position="right",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Not only does this result in a visually striking graph but a lot can be learned from it. In looking at the year over year information it seems like the month that has the lowest selling price in February and the month with the highest selling price in July. Which means the best time to sell your house is in the summer and the best time to buy is in winter, according to this graph.

Not surprising the year with the highest average home selling price is 2007. And the year with the lowest sales is 2010, which was discovered to be the case earlier in the decomposition of the time series data. But this helps to go to show how closely the home prices were between 2008 and 2011 until home price begin to rise in the second half of the year.

Boxplots

And the other seasonal plot we will look at is the boxplot of the seasonal effect.

ggplot() + 
  geom_boxplot(aes(x = factor(cycle(kcts)), y = kcts, fill =factor(cycle(kcts)))) + 
  scale_x_discrete(labels=c("1"="Jan","2"="Feb","3"="Mar","4"= "Apr","5"="May", "6"="Jun","7"="Jul","8"="Aug","9"="Sep","10"="Oct","11"="Nov","12"="Dec"))+
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "Seasonal Boxplot of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Month", y = "Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

The is confirms some of the things discovered in the last plot which it looks like February is the lowest average selling price month with June being the traditionally the highest. What is interesting is the amount of the amount of variation that the graph shows between average home selling. It seems like there is a similar variation between January and May but the variation shrinks in June and also is noticeably lower in August, September, and October, but then the variation begins to widen as the year comes to a close.

Holts Winters

Now that both visual and statistical testing has been completed on the dataset, I will begin creating and selecting models for future forecasting starting with the Holts Winters model. I choose to start with Holts Winters instead of just the Holts because of the seasonality that my data has. For this model and the ETS model, I will use just the normal time series data.

Models

I will start by creating 4 different Holts Winters Models: additive, additive damped, multiplicative, and multiplicative damped.

fit1 <- hw(kcts, seasonal="additive", h=36) 
fit2 <- hw(kcts, seasonal="additive", damped = TRUE, h=36)
fit3 <- hw(kcts,seasonal="multiplicative",h=36) 
fit4 <- hw(kcts,seasonal="multiplicative", damped=TRUE, h=36) 

fit1$model
## Holt-Winters' additive method 
## 
## Call:
##  hw(y = kcts, h = 36, seasonal = "additive") 
## 
##   Smoothing parameters:
##     alpha = 0.4028 
##     beta  = 0.0354 
##     gamma = 0.0001 
## 
##   Initial states:
##     l = 180374.9332 
##     b = -489.709 
##     s = 11060 15997 6922 -2288 -6449 -14442
##            -12750 -2040 -1221 -937.5 -122.4 6272
## 
##   sigma:  4572
## 
##  AIC AICc  BIC 
## 2614 2620 2661
fit2$model
## Damped Holt-Winters' additive method 
## 
## Call:
##  hw(y = kcts, h = 36, seasonal = "additive", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.285 
##     beta  = 0.0982 
##     gamma = 0.0001 
##     phi   = 0.9003 
## 
##   Initial states:
##     l = 180374.8839 
##     b = -991.5613 
##     s = 11060 15997 6922 -2288 -6449 -14418
##            -12715 -1759 -1151 -937.3 -122.7 5860
## 
##   sigma:  4539
## 
##  AIC AICc  BIC 
## 2613 2620 2663
fit3$model
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = kcts, h = 36, seasonal = "multiplicative") 
## 
##   Smoothing parameters:
##     alpha = 0.2291 
##     beta  = 0.1064 
##     gamma = 0.0001 
## 
##   Initial states:
##     l = 180382.5764 
##     b = -474.5543 
##     s = 1.057 1.087 1.039 0.9884 0.964 0.9218
##            0.934 0.9923 0.994 0.9926 0.9958 1.034
## 
##   sigma:  0.0297
## 
##  AIC AICc  BIC 
## 2649 2655 2696
fit4$model
## Damped Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = kcts, h = 36, seasonal = "multiplicative", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.4189 
##     beta  = 0.0353 
##     gamma = 0.0001 
##     phi   = 0.9695 
## 
##   Initial states:
##     l = 180382.6574 
##     b = -982.3854 
##     s = 1.056 1.086 1.039 0.9883 0.9647 0.9224
##            0.932 0.9919 0.9948 0.9943 0.9958 1.034
## 
##   sigma:  0.0282
## 
##  AIC AICc  BIC 
## 2637 2643 2687

It is interesting that will all these models that they have an Alpha under .5 and a Beta under .1. This helps to show that these models use many different lags back to help forecast the future points.

Forecast Plots

Now that the models have been created let’s see how the models look visually.

kcts%>%autoplot +
  labs(title = "Holts Winters Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="Additive"),fitted(fit1))+
  geom_line(aes(colour="Additive"),fit1$mean)+
  #geom_line(aes(colour="Multiplicative"),fitted(fit2))+
  geom_line(aes(colour="Multiplicative"),fit2$mean)+
  #geom_line(aes(colour="Additive damped trend"),fitted(fit3))+
  geom_line(aes(colour="Additive damped trend"),fit3$mean)+
  #geom_line(aes(colour="Multiplicative damped trend"),fitted(fit4))+
  geom_line(aes(colour="Multiplicative damped trend"),fit4$mean)+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

kcts%>%autoplot +
  labs(title = "Holts Winters Forecasts of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="Additive"),fitted(fit1))+
  geom_line(aes(colour="Additive"),fit1$mean)+
  #geom_line(aes(colour="Multiplicative"),fitted(fit2))+
  geom_line(aes(colour="Multiplicative"),fit2$mean)+
  #geom_line(aes(colour="Additive damped trend"),fitted(fit3))+
  geom_line(aes(colour="Additive damped trend"),fit3$mean)+
  #geom_line(aes(colour="Multiplicative damped trend"),fitted(fit4))+
  geom_line(aes(colour="Multiplicative damped trend"),fit4$mean)+
  coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,275000))+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

It seems like with the four models, the additive damped, multiplicative, and multiplicative damped all created similar results with the additive seeming to increase at a greater rate than the other. The slope of the additive model seems to match the slope of the past few years of data where the other models seem to be more conservative in their forecasts. But all models seem to capture the seasonal effect very well.

Accuracy

To determine which model will be the best to use, we’ll look at the AIC, AICC, and other Accuracy of the models.

fit1$model$aic
## [1] 2614
fit2$model$aic
## [1] 2613
fit3$model$aic
## [1] 2649
fit4$model$aic
## [1] 2637
fit1$model$aicc
## [1] 2620
fit2$model$aicc
## [1] 2620
fit3$model$aicc
## [1] 2655
fit4$model$aicc
## [1] 2643
kable(accuracy(fit1))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 316.9 4257 3309 0.1632 1.923 0.3108 -0.075
kable(accuracy(fit2))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 390.6 4205 3328 0.1719 1.922 0.3126 -0.033
kable(accuracy(fit3))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 14.97 4743 3752 -0.0113 2.145 0.3525 0.1892
kable(accuracy(fit4))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 479.2 4406 3537 0.2269 2.039 0.3322 -0.0468

The AIC between the model is tied between the additive and additive damped, but the AICC is lower for the additive. But the additive damped model does have a lower RSME, but the additive has a lower MAE. It is difficult to choose which is the best model to use so I am going to use the ETS model to aid in the selecting, to see which it will recommend.

ETS

The ETS model can choose from every Holts Winters model and more because it can use multiplicative error and not the just additive error like Holts Winters, but because the time series is not exponential the ETS model will just be selecting from the same options that the Holts Winters provided.

Models

For the ETS models I have decided to just let the function pick the best model to use but for the first first model I want it to pick based on AIC and then the second model will just use the default parameters to pick.

fit5 <- ets(kcts, ic=c("aic"))

fit5
## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = kcts, ic = c("aic")) 
## 
##   Smoothing parameters:
##     alpha = 0.2852 
##     beta  = 0.098 
##     gamma = 0.0001 
##     phi   = 0.9006 
## 
##   Initial states:
##     l = 180374.8841 
##     b = -991.5614 
##     s = 11060 15997 6922 -2288 -6449 -14418
##            -12715 -1759 -1151 -937.3 -122.7 5860
## 
##   sigma:  4539
## 
##  AIC AICc  BIC 
## 2613 2620 2663
fit6 <- ets(kcts)

fit6
## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = kcts) 
## 
##   Smoothing parameters:
##     alpha = 0.2852 
##     beta  = 0.098 
##     gamma = 0.0001 
##     phi   = 0.9006 
## 
##   Initial states:
##     l = 180374.8841 
##     b = -991.5614 
##     s = 11060 15997 6922 -2288 -6449 -14418
##            -12715 -1759 -1151 -937.3 -122.7 5860
## 
##   sigma:  4539
## 
##  AIC AICc  BIC 
## 2613 2620 2663

Based on AIC the ETS models selected was an additive error, additive damped trend, and additive seasonal effect, which is what the Holts Winters additive damped model is. And Based on the normal function the ETS model selected was an additive error, additive trend, and additive seasonal effect. It good to see that the two models selected were the same two models that were being looked at earlier but it would have been nice to have had a clear cut winner between the two, though it does seem like we are indeed looking in the right place.

Forecast Plots

Now that we have two ETS models created let’s see how their forecasts look.

fc5<-forecast(fit5, h=36)
fc6<-forecast(fit6, h=36)

kcts%>%autoplot + 
   labs(title = "ETS Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="ETS(AAdA)"),fitted(fit5))+
  geom_line(aes(colour="ETS(AAdA)"),fc5$mean)+
  #geom_line(aes(colour="ETS(AAA)"),fitted(fit6))+
  geom_line(aes(colour="ETS(AAA)"),fc6$mean)+
  theme_minimal() + 
  scale_colour_calc(breaks= c("ETS(AAdA)","ETS(AAA)")) +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

kcts%>%autoplot + 
   labs(title = "ETS Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="ETS(AAdA)"),fitted(fit5))+
  geom_line(aes(colour="ETS(AAdA)"),fc5$mean)+
  #geom_line(aes(colour="ETS(AAA)"),fitted(fit6))+
  geom_line(aes(colour="ETS(AAA)"),fc6$mean)+
  theme_minimal() + 
  scale_colour_calc(breaks= c("ETS(AAdA)","ETS(AAA)")) +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+
    coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,275000))

In just like the Holts Winters model, the ETS(A,A,A) model really takes off in what appears to be a similar trajectory that the data has had for the past few years, where the ETS (A,Ad,A) model seems to be predicting a forecast that is about the same for all three years.

Accuracy

To help pick between these two models I will be looking at the accuracy between them yet again

fit5$aic
## [1] 2613
fit6$aic
## [1] 2613
kable(accuracy(fit5))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 390.4 4205 3327 0.1718 1.922 0.3125 -0.0331
kable(accuracy(fit6))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 390.4 4205 3327 0.1718 1.922 0.3125 -0.0331

Since the first ETS model was selected by its AIC it makes sense that it has the better AIC between the two, but they are both very close to each other, and the first model does have a better RSME the second model, but the second model has the better MAE.

Even though they are very close I have decided to use the second ETS model for my forecast between the Holts Winters and ETS models. I feel like it could very easily be other model but looking beyond the time series data, home pricing has historical risen year over year so I have selected the best model that I feel like has done that.

Forecast with Confidence Intervals

Here is what the forecast for the ETS(A,A,A) model looks like with confidence intervals.

fc6%>%autoplot+
   labs(title = "ETS(AAA) Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))

Even though it was tough to pick between the two models, looking at the confidence intervals does make me feel better about my choice, because the possibility that home prices decline is within the 95% confidence intervals that are listed.

ARIMA

For the next time series forecast model, I am going to look at is the different ARIMA models. Through the analysis of the dataset, it has been determined that once the data is differenced it becomes normalized, which the ARIMA model can account for in the model.

12th Difference Plots

Even though we have already looked at two different plots of the dataset now I need to transform it once more, I need to take the 12 lag difference of the differenced dataset. This is to help account for the seasonality so that we can start to estimate the ARIMA models.

diff(diff(kcts),12)%>% autoplot()+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
  labs(title = "Differced House Prices Differenced to 12th Period",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

The new time plot looks like a similar format from the original differenced data, but it also looks completely different from taking the lag of the 12 periods. The data still looks normalized but the spikes seem more drastic in this plot.

12 Difference Plot ACF and PACF Plots

To start estimating the ARIMA models will look at the ACF and PACF plots of the 12th differenced difference data to help us to understand autoregressive and moving of the model and the seasonality.

p1 <- ggAcf(diff(diff(kcts),12)) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF Plots of Differenced Avereage Kansas City Home Selling Price')
  
p2<-ggPacf(diff(diff(kcts),12)) + theme_minimal()+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')

multiplot(p1,p2, cols =1)

Starting with the seasonality there appears to big spikes at the 12th lag still, so the 12th difference was not enough by itself to take care of the seasonality. Since we see those big spikes on both the ACF and PACF pot I think that both an autoregressive of 1 or (1,0,0)12 and a moving average of 1 or (0,0,1)12 should be tried to account for the seasonal effect.

Now the normal model, with the difference of the data we know that model will have a difference of 1. The PACF plot seems to have significant lags at the second position that means an autoregressive effect for 2 lags or (2,1,0), and the ACF plot has a significant lag at first position means there may be a moving average effect at 1 lag or (0,1,1). Or it is possible that both are needed for an ARIMA of (2,1,1)

Now combing those two different parts of the model there are 6 models to test for:

(2,1,0)(1,0,0)12 (2,1,0)(0,0,1)12 (0,1,1)(1,0,0)12 (0,1,1)(0,0,1)12 (2,1,1)(1,0,0)12 (2,1,1)(0,0,1)12

Models

Now that we have identified the models we want to take a look at, let’s create the models.

fit7<-arima(kcts, c(2,1,0), c(1,0,0))
fit8<-arima(kcts, c(2,1,0), c(0,0,1))
fit9<-arima(kcts, c(0,1,1), c(1,0,0))
fit10<-arima(kcts, c(0,1,1), c(0,0,1))
fit11<-arima(kcts, c(2,1,1), c(1,0,0))
fit12<-arima(kcts, c(2,1,1), c(0,0,1))

fit7
## 
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(1, 0, 0))
## 
## Coefficients:
##          ar1     ar2   sar1
##       -0.610  -0.172  0.773
## s.e.   0.102   0.096  0.059
## 
## sigma^2 estimated as 35412228:  log likelihood = -1209,  aic = 2426
fit8
## 
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(0, 0, 1))
## 
## Coefficients:
##          ar1    ar2   sma1
##       -0.199  0.012  0.358
## s.e.   0.109  0.093  0.077
## 
## sigma^2 estimated as 54393952:  log likelihood = -1230,  aic = 2467
fit9
## 
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(1, 0, 0))
## 
## Coefficients:
##          ma1   sar1
##       -0.541  0.745
## s.e.   0.081  0.060
## 
## sigma^2 estimated as 36447527:  log likelihood = -1210,  aic = 2426
fit10
## 
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(0, 0, 1))
## 
## Coefficients:
##          ma1   sma1
##       -0.198  0.354
## s.e.   0.109  0.076
## 
## sigma^2 estimated as 54473971:  log likelihood = -1230,  aic = 2465
fit11
## 
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(1, 0, 0))
## 
## Coefficients:
##         ar1    ar2     ma1   sar1
##       0.258  0.309  -0.834  0.737
## s.e.  0.334  0.226   0.282  0.063
## 
## sigma^2 estimated as 35751300:  log likelihood = -1209,  aic = 2427
fit12
## 
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(0, 0, 1))
## 
## Coefficients:
##         ar1    ar2     ma1   sma1
##       0.611  0.042  -0.855  0.343
## s.e.  0.132  0.115   0.073  0.081
## 
## sigma^2 estimated as 52236660:  log likelihood = -1227,  aic = 2464

In just looking at the AIC of the models it seems like the models may be pretty close to each other.

Forecast Plots

Now that the models are created let’s see how all models look plotted together.

fc7<-forecast(fit7,h=36)
fc8<-forecast(fit8,h=36)
fc9<-forecast(fit9,h=36)
fc10<-forecast(fit10,h=36)
fc11<-forecast(fit11,h=36)
fc12<-forecast(fit12,h=36)

kcts%>%autoplot +
  labs(title = "ARIMA Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fitted(fit7))+
  geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fc7$mean)+
  #geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fitted(fit8))+
  geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fc8$mean)+
  #geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fitted(fit9))+
  geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fc9$mean)+
  #geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fitted(fit10))+
  geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fc10$mean)+
  #geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fitted(fit11))+
  geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fc11$mean)+
  #geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fitted(fit12))+
  geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fc12$mean)+
  theme_minimal() + 
  scale_colour_calc(breaks=c("ARIMA(2,1,0)(1,0,0)12","ARIMA(2,1,0)(0,0,1)12","ARIMA(0,1,1)(1,0,0)12","ARIMA(0,1,1)(0,0,1)12","ARIMA(2,1,1)(1,0,0)12","ARIMA(2,1,1)(0,0,1)12")) +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

kcts%>%autoplot +
  labs(title = "ARIMA Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
  #geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fitted(fit7))+
  geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fc7$mean)+
  #geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fitted(fit8))+
  geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fc8$mean)+
  #geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fitted(fit9))+
  geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fc9$mean)+
  #geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fitted(fit10))+
  geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fc10$mean)+
  #geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fitted(fit11))+
  geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fc11$mean)+
  #geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fitted(fit12))+
  geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fc12$mean)+
  theme_minimal() + 
  scale_colour_calc(breaks=c("ARIMA(2,1,0)(1,0,0)12","ARIMA(2,1,0)(0,0,1)12","ARIMA(0,1,1)(1,0,0)12","ARIMA(0,1,1)(0,0,1)12","ARIMA(2,1,1)(1,0,0)12","ARIMA(2,1,1)(0,0,1)12")) +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+
    coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,262500))

It’s interesting that all models looked closed by there AIC but in graphing them together there are some striking differences. The three models with the autoregressive seasonality effect look a lot like the other models we have seen but the models with moving average seasonal effect the estimation really dies off quickly. It will be interesting to see how all the models perform, but in terms of looking at models to forecast 3 years out, we might have to use the autoregressive seasonality by default.

ARIMA Residual ACF Plots

Now that we have seen the forecast plot I want to see the ACF plots of the residuals of the model to see if all the autocorrelation in the residuals has been accounted for in the model.

p1<-fit7 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(2,1,0)(1,0,0)12 Model", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p2<-fit8 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(2,1,0)(0,0,1)12 Model", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p3<-fit9 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(0,1,1)(1,0,0)12 Model", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p4<-fit10 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(0,1,1)(0,0,1)12 Model", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p5<-fit11 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(2,1,1)(1,0,0)12", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p6<-fit12 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMA(2,1,1)(0,0,1)12", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

multiplot(p1,p2,p3,p4,p5,p6,cols=1)

In first looking at the ACF plot for the residuals for every model it appears that no model accounts for all the autocorrelation. Which is unfortunate for the time series alone but this may be able to be accounted for when we bring in the other variables.

ARIMA Residual Box-Ljung Tests

To make sure that all the autocorrelation has not been accounted for I thought we should run the Box-Ljung test for all the residuals for the ARIMA models we have created.

Box.test(residuals(fit7), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit7)
## X-squared = 50, df = 24, p-value = 0.002
Box.test(residuals(fit8), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit8)
## X-squared = 120, df = 24, p-value = 7e-15
Box.test(residuals(fit9), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit9)
## X-squared = 65, df = 24, p-value = 1e-05
Box.test(residuals(fit10), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit10)
## X-squared = 120, df = 24, p-value = 2e-15
Box.test(residuals(fit11), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit11)
## X-squared = 56, df = 24, p-value = 0.0002
Box.test(residuals(fit12), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit12)
## X-squared = 110, df = 24, p-value = 7e-13

In all the Box-Ljong test we rejected the null hypothesis meaning there is still autocorrelation left in the residuals.

Accuracy

Since there is still autocorrelation left in the residuals we can’t determine the best model by which one accounts for all the autocorrelation of the data we will have to look at the accuracy of the models.

summary(fit7)
## 
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(1, 0, 0))
## 
## Coefficients:
##          ar1     ar2   sar1
##       -0.610  -0.172  0.773
## s.e.   0.102   0.096  0.059
## 
## sigma^2 estimated as 35412228:  log likelihood = -1209,  aic = 2426
## 
## Training set error measures:
##                 ME RMSE  MAE    MPE  MAPE   MASE      ACF1
## Training set 342.8 5926 4624 0.1159 2.704 0.7179 -0.009737
summary(fit8)
## 
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(0, 0, 1))
## 
## Coefficients:
##          ar1    ar2   sma1
##       -0.199  0.012  0.358
## s.e.   0.109  0.093  0.077
## 
## sigma^2 estimated as 54393952:  log likelihood = -1230,  aic = 2467
## 
## Training set error measures:
##                 ME RMSE  MAE     MPE  MAPE   MASE      ACF1
## Training set 348.3 7344 5751 0.06123 3.325 0.8927 0.0006566
summary(fit9)
## 
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(1, 0, 0))
## 
## Coefficients:
##          ma1   sar1
##       -0.541  0.745
## s.e.   0.081  0.060
## 
## sigma^2 estimated as 36447527:  log likelihood = -1210,  aic = 2426
## 
## Training set error measures:
##                 ME RMSE  MAE   MPE  MAPE   MASE     ACF1
## Training set 431.4 6012 4552 0.151 2.674 0.7067 -0.07133
summary(fit10)
## 
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(0, 0, 1))
## 
## Coefficients:
##          ma1   sma1
##       -0.198  0.354
## s.e.   0.109  0.076
## 
## sigma^2 estimated as 54473971:  log likelihood = -1230,  aic = 2465
## 
## Training set error measures:
##                 ME RMSE  MAE     MPE  MAPE   MASE      ACF1
## Training set 367.4 7350 5727 0.06607 3.311 0.8891 -0.001585
summary(fit11)
## 
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(1, 0, 0))
## 
## Coefficients:
##         ar1    ar2     ma1   sar1
##       0.258  0.309  -0.834  0.737
## s.e.  0.334  0.226   0.282  0.063
## 
## sigma^2 estimated as 35751300:  log likelihood = -1209,  aic = 2427
## 
## Training set error measures:
##                 ME RMSE  MAE    MPE  MAPE   MASE     ACF1
## Training set 503.2 5954 4533 0.1899 2.653 0.7038 -0.03588
summary(fit12)
## 
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(0, 0, 1))
## 
## Coefficients:
##         ar1    ar2     ma1   sma1
##       0.611  0.042  -0.855  0.343
## s.e.  0.132  0.115   0.073  0.081
## 
## sigma^2 estimated as 52236660:  log likelihood = -1227,  aic = 2464
## 
## Training set error measures:
##                 ME RMSE  MAE   MPE  MAPE   MASE      ACF1
## Training set 641.4 7197 5563 0.175 3.212 0.8636 0.0009338

In looking at the models with the lowest AIC the (2,1,0)(1,0,0) and (0,1,1)(1,0,0) models have the lowest AIC between all the models. And between those two models, the first one has the lowest RMSE. The other model with the autoregressive seasonality had good AIC and RMSE but I think if I had to choose a model I would go with the first model of (2,1,0)(1,0,0).

Auto ARIMA

But just to make sure that we picked the right model I am going to use the Auto ARIMA function to see which model it selects.

fit13<-auto.arima(kcts)

fit13
## Series: kcts 
## ARIMA(1,1,1)(0,1,1)[12] 
## 
## Coefficients:
##          ar1     ma1    sma1
##       -0.261  -0.345  -0.833
## s.e.   0.149   0.136   0.148
## 
## sigma^2 estimated as 24171419:  log likelihood=-1067
## AIC=2142   AICc=2142   BIC=2153

The Auto ARIMA function ended up picking the same model that I selected which is reassuring.

Forecast with Confidence Intervals

Now that I have selected an ARIMA model I want to plot it with the confidence intervals.

fc7%>%autoplot+
  labs(title = "ARIMA(2,1,0)(1,0,0)12 Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))

Feel like the ARIMA model end up turning out very well, it doesn’t exactly follow the slope of the data, but it still increases as time goes one. The seasonal effect seems to works well and I like that model was created with normalized data. But this model, the Holts Winters, and the ETS model I think that this ARIMA model is the best to use to forecast the average selling price of homes in Kansas City.

Multi-Variable Time Series

Now that we have explored forecasting the average home selling in Kansas City based on the time series alone I want to try to forecast it using other variables as well. In my mind, the forecast of the time series by itself and with other variable is completely different so I decided to separate the analysis. I feel like the analysis of the single time series gave a good simple forecast but now I want to see if we can use other variables to help examine where home selling prices are going to go.

Time Series Variables

With the original dataset, I collected the number of home listings in Kansas City, the number of new housing permits in Kansas City, and the US Home Price Index (HPI). I will use these other variables to see if they will be helpful in predicting on where home pricing is going.

Vairable Plots

The first thing I want to do is plot all the time series together to perform a basic visual analysis.

kcvarts <- ts(kchouse[,2:5], start=c(2007,8),frequency=12)

p1<-kcvarts[,1] %>%autoplot+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
   theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+
  labs(title = "Variable of KC House Data",subtitle="From 10/10/2016 to 10/9/2017",x = "Date", y = "Avg Price")
p2<-kcvarts[,2] %>%autoplot+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
   theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+labs(y = "Listing")
p3<-kcvarts[,3] %>%autoplot+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
   theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+labs(y = "Permit")
p4<-kcvarts[,4] %>%autoplot+
  theme_minimal() + 
  geom_line(aes(colour="price"))+
  scale_colour_calc() +
   theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+labs(y = "US HPI")


multiplot(p1,p2,p3,p4,cols=1)

The results are very interesting, it seems like the number of permits of and the US HPI follows very closely to the original plot of the average home selling price, but the number of listing performs very differently. It seems like for the first part of the plot the number of listing follows the plot of the selling price but when home price picks up the number of houses being sold continued to drop.

Cross Correlation

Now that we have looked at the variable let’s see how they relate to the time series were are trying to forecast for. To do this I am going to use the cross-correlation function to help be find out at which lags are the different time series are the most correlated.

ccf(kchouse[,2], kchouse[,3])

ccf(kchouse[,2], kchouse[,4])

ccf(kchouse[,2], kchouse[,5])

For the in terms of the average selling price of a home, the number of listing is most correlated with it five lags into the future. But other than that one variable the other variables are most correlated with each other at the lag of 0, of at the same interval of time.

TSLM

Now that we have the other variables, and know when they are strongly correlated with each other let’s begin the forecasting with a time series linear model.

Models

To make sure that we are on the right path using the other variable I want to run just a quick two quick TSLM models. One with just trend and seasonal effect and the other with a trend, seasonal effect, and the other variables added in.

fit14 = tslm(kcvarts[,1]~trend+season)
fit15 = tslm(kcvarts[,1]~trend+season+kcvarts[,2:4])

summary(fit14)
## 
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24134  -8778   -247   5368  31834 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 135302.9     4457.3   30.36  < 2e-16 ***
## trend          527.2       33.4   15.80  < 2e-16 ***
## season2      -1882.3     5632.5   -0.33  0.73890    
## season3       5777.9     5632.8    1.03  0.30731    
## season4       9679.2     5633.3    1.72  0.08865 .  
## season5      19014.2     5634.0    3.37  0.00103 ** 
## season6      27784.5     5634.9    4.93    3e-06 ***
## season7      22110.0     5636.0    3.92  0.00015 ***
## season8      19623.9     5634.9    3.48  0.00072 ***
## season9      12076.0     5634.0    2.14  0.03434 *  
## season10     11865.2     5633.3    2.11  0.03752 *  
## season11     11910.1     5632.8    2.11  0.03680 *  
## season12     10993.1     5632.5    1.95  0.05358 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12600 on 107 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.716 
## F-statistic:   26 on 12 and 107 DF,  p-value: <2e-16
summary(fit15)
## 
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season + kcvarts[, 2:4])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11695  -2046    199   2374  10236 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                303.093  16498.132    0.02  0.98538    
## trend                      301.878     46.393    6.51  2.7e-09 ***
## season2                  -2521.122   1765.547   -1.43  0.15630    
## season3                   2717.296   1833.029    1.48  0.14126    
## season4                   5040.063   1864.488    2.70  0.00802 ** 
## season5                  12677.761   1906.967    6.65  1.4e-09 ***
## season6                  19997.275   1982.186   10.09  < 2e-16 ***
## season7                  15612.737   1944.469    8.03  1.6e-12 ***
## season8                  12040.430   1979.942    6.08  2.0e-08 ***
## season9                   5994.460   1917.345    3.13  0.00229 ** 
## season10                  6587.403   1893.137    3.48  0.00073 ***
## season11                  8799.967   1814.067    4.85  4.3e-06 ***
## season12                  9336.104   1795.753    5.20  1.0e-06 ***
## kcvarts[, 2:4]listing        0.546      0.485    1.12  0.26328    
## kcvarts[, 2:4]new_permit    91.530     43.604    2.10  0.03822 *  
## kcvarts[, 2:4]us_hpi       668.291     53.708   12.44  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3940 on 104 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.972 
## F-statistic:  278 on 15 and 104 DF,  p-value: <2e-16

In just this quick model we see the Adjusted R-squared go from .716 to .972 once we add in the other variables, so it definitely seems like adding in the other variables will be very beneficial.

Variables Lag 5 Listing

Now that we know that adding in other variables adds values I want to see the effect of the lagging average selling price, number of permits, and the US HPI 5 period behind the number of listing. To do this I create another list of variables when everything is lagged behind listing 5, then I created another one just with last five periods removed to compare with the lagged 5 dataset.

kcvarts2<-as.data.frame(cbind(c(kcvarts[6:120,2]),
                 c(kcvarts[1:115,3]),
                 c(kcvarts[1:115,4])))
kcvarts2<-ts(kcvarts2, start=c(2007,8),frequency=12)


kcvarts3<-kcvarts[1:115,2:4]
kcvarts3<-ts(kcvarts3, start=c(2007,8),frequency=12)

Now that we have the new variable we can test for significance.

Models Lag 5 Listing

Using these new variables I want to create two new models to compare the effects of transforming the variables.

kcts2<-ts(kcts[1:115], start=c(2007,8),frequency=12)

fit16 = tslm(kcts2~trend+season+kcvarts2)
fit17 = tslm(kcts2~trend+season+kcvarts3)

summary(fit16)
## 
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10592  -2259    433   2301   9500 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21616.247  17522.117   -1.23  0.22025    
## trend          346.698     44.139    7.85  4.9e-12 ***
## season2      -2387.238   1757.493   -1.36  0.17745    
## season3       3639.881   1906.259    1.91  0.05910 .  
## season4       6740.232   1984.819    3.40  0.00099 ***
## season5      14432.873   2006.843    7.19  1.2e-10 ***
## season6      22930.810   2246.747   10.21  < 2e-16 ***
## season7      19851.995   2257.701    8.79  4.7e-14 ***
## season8      15679.496   2262.808    6.93  4.3e-10 ***
## season9       9064.712   2107.709    4.30  4.0e-05 ***
## season10      8980.673   1991.553    4.51  1.8e-05 ***
## season11      9968.356   1789.882    5.57  2.2e-07 ***
## season12      9726.255   1792.875    5.42  4.1e-07 ***
## kcvarts2V1       1.037      0.469    2.21  0.02925 *  
## kcvarts2V2      59.940     41.444    1.45  0.15125    
## kcvarts2V3     733.452     56.911   12.89  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3920 on 99 degrees of freedom
## Multiple R-squared:  0.971,  Adjusted R-squared:  0.967 
## F-statistic:  225 on 15 and 99 DF,  p-value: <2e-16
summary(fit17)
## 
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11742  -2070    287   2538  10341 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -4119.699  17012.587   -0.24  0.80916    
## trend                307.522     47.415    6.49  3.5e-09 ***
## season2            -2500.867   1788.692   -1.40  0.16519    
## season3             2948.504   1917.840    1.54  0.12738    
## season4             5364.122   1967.188    2.73  0.00757 ** 
## season5            12788.887   2005.545    6.38  5.8e-09 ***
## season6            20361.004   2077.974    9.80  3.0e-16 ***
## season7            16300.389   2024.884    8.05  1.9e-12 ***
## season8            12201.115   2016.064    6.05  2.6e-08 ***
## season9             6108.426   1949.000    3.13  0.00227 ** 
## season10            6715.149   1924.518    3.49  0.00072 ***
## season11            8803.415   1838.797    4.79  5.9e-06 ***
## season12            9440.739   1820.896    5.18  1.1e-06 ***
## kcvarts3listing        0.587      0.497    1.18  0.23971    
## kcvarts3new_permit    76.925     45.552    1.69  0.09442 .  
## kcvarts3us_hpi       690.064     56.939   12.12  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3990 on 99 degrees of freedom
## Multiple R-squared:  0.97,   Adjusted R-squared:  0.966 
## F-statistic:  217 on 15 and 99 DF,  p-value: <2e-16

The Adjusted R-squared of the models are very close to each other and it may not be worth going through the trouble of taking the data 5 lags behind listing.

Significant Variables Lag 5 Listing

To see if it is actually worth taking the data 5 lags behind listing, I decided to remove the variables that were not significant between the two models. In the first model the number of new permits was not significant and in the second model, the number of the listing was not significant. So I will create two more sets of variables to test against.

kcvarts4<-as.data.frame(cbind(c(kcvarts[6:120,2]),
                 c(kcvarts[1:115,4])))
kcvarts4<-ts(kcvarts4, start=c(2007,8),frequency=12)


kcvarts5<-kcvarts[1:115,3:4]
kcvarts5<-ts(kcvarts5, start=c(2007,8),frequency=12)

Models Significant Variables Lag 5 Listing

Now that the new variables are ready we can test the two models again.

fit18 = tslm(kcts2~trend+season+kcvarts4)
fit19 = tslm(kcts2~trend+season+kcvarts5)

summary(fit18)
## 
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10612  -2327    282   2508   9287 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -40024.248  12108.461   -3.31  0.00132 ** 
## trend          384.883     35.565   10.82  < 2e-16 ***
## season2      -2265.855   1765.045   -1.28  0.20220    
## season3       4524.944   1815.193    2.49  0.01432 *  
## season4       7937.450   1813.786    4.38  3.0e-05 ***
## season5      15702.233   1814.577    8.65  8.8e-14 ***
## season6      24814.489   1840.702   13.48  < 2e-16 ***
## season7      21502.268   1958.749   10.98  < 2e-16 ***
## season8      17509.769   1886.070    9.28  3.7e-15 ***
## season9      10533.613   1856.928    5.67  1.4e-07 ***
## season10     10241.275   1800.444    5.69  1.3e-07 ***
## season11     10399.355   1774.509    5.86  5.9e-08 ***
## season12     10244.173   1766.313    5.80  7.8e-08 ***
## kcvarts4V1       1.420      0.389    3.65  0.00042 ***
## kcvarts4V2     803.381     30.182   26.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3950 on 100 degrees of freedom
## Multiple R-squared:  0.971,  Adjusted R-squared:  0.967 
## F-statistic:  238 on 14 and 100 DF,  p-value: <2e-16
summary(fit19)
## 
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts5)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11826  -2119     70   2703   9615 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         14475.9     6514.8    2.22  0.02854 *  
## trend                 254.0       14.1   18.04  < 2e-16 ***
## season2             -2430.7     1791.3   -1.36  0.17784    
## season3              3007.1     1921.0    1.57  0.12065    
## season4              5633.9     1957.8    2.88  0.00490 ** 
## season5             13248.7     1971.4    6.72  1.1e-09 ***
## season6             20744.5     2056.6   10.09  < 2e-16 ***
## season7             17086.9     1916.4    8.92  2.3e-14 ***
## season8             12923.1     1925.3    6.71  1.2e-09 ***
## season9              6771.1     1870.5    3.62  0.00046 ***
## season10             7268.8     1870.5    3.89  0.00018 ***
## season11             9283.9     1796.9    5.17  1.2e-06 ***
## season12             9154.9     1808.4    5.06  1.9e-06 ***
## kcvarts5new_permit    111.7       34.9    3.20  0.00182 ** 
## kcvarts5us_hpi        641.7       39.7   16.15  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4000 on 100 degrees of freedom
## Multiple R-squared:  0.97,   Adjusted R-squared:  0.966 
## F-statistic:  231 on 14 and 100 DF,  p-value: <2e-16

Again we get two models that are very close in terms of adjusted R-squared with the first model have .967 and the second model have one of .966. It worth noting too that both models have a p-value of well under .05, meaning that the models are significant.

Final Model

Between the two models, I have decided to just use the variable of the number of permits and the US HPI. Even though the other model is performed slightly better, I don’t think a .001 increase in Adjusted R-squared is worth waiting 5 months to predict the average home selling price after the fact. Now that we are just using permits and US HPI I want to run the model one more time using the entire time series again and not have to remove the last five periods.

fit20 = tslm(kcvarts[,1]~trend+season+kcvarts[,c(3,4)])

summary(fit20)
## 
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season + kcvarts[, c(3, 
##     4)])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11780  -2022   -122   2470   9552 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   17720.1     5699.7    3.11  0.00242 ** 
## trend                           252.0       13.8   18.30  < 2e-16 ***
## season2                       -2455.8     1766.8   -1.39  0.16748    
## season3                        2728.1     1835.3    1.49  0.14015    
## season4                        5283.8     1854.2    2.85  0.00527 ** 
## season5                       13078.5     1875.7    6.97  2.9e-10 ***
## season6                       20321.7     1963.6   10.35  < 2e-16 ***
## season7                       16379.0     1823.5    8.98  1.2e-14 ***
## season8                       12710.6     1890.5    6.72  9.5e-10 ***
## season9                        6609.8     1840.0    3.59  0.00050 ***
## season10                       7101.1     1839.5    3.86  0.00020 ***
## season11                       9246.9     1772.3    5.22  9.2e-07 ***
## season12                       9070.2     1782.4    5.09  1.6e-06 ***
## kcvarts[, c(3, 4)]new_permit    123.9       32.8    3.78  0.00026 ***
## kcvarts[, c(3, 4)]us_hpi        622.7       35.3   17.66  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3950 on 105 degrees of freedom
## Multiple R-squared:  0.975,  Adjusted R-squared:  0.972 
## F-statistic:  297 on 14 and 105 DF,  p-value: <2e-16

I am really happy with the way the TSLM turned out. In the LM model format, we can use the coefficients of the seasonal effect to indeed see that February is indeed the month with the lowest home selling price where June is the highest. You can also see as the US HPI increase so does the prices of houses but interesting enough the home prices increases with new permits as well, which is the opposite of what I would expect.

Forecast Models Durbin-Watson Test

Now that we have a new model I want to see if the residuals of the new model have any autocorrelation in the residuals.

dwtest(fit20, alt="two.sided")
## 
##  Durbin-Watson test
## 
## data:  fit20
## DW = 1.8, p-value = 0.2
## alternative hypothesis: true autocorrelation is not 0

In the Durbin-Watson test, the p-value is greater can .05 so can accept the null hypothesis that there is no autocorrelation significant autocorrelation in the residuals.

Forecast Variables

Now that we have a model I wanted to get an idea of what a forecast may look like so I used the Auto ARIMA model to forecast some future values of the new permits and US HPI. Now with this being said, I’m not going to compare this with other forecasts that I have created because the error with this forecast model would be much great but I thought it was an exercise worth trying.

kcpermit<-forecast(auto.arima(kcvarts[,3]),h=36)
kcus_hpi<-forecast(auto.arima(kcvarts[,4]),h=36)

kcvarts6<-as.data.frame(cbind(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kcpermit$mean),c(kcus_hpi$mean)))

kcvarts<-ts(kcvarts6, start=c(2017,8),frequency=12)

Forecast with Confidence Intervals

Now that I have forecasted variables I can create a forecast based on those other variables.

fc20 <- forecast(fit20, newdata=kcvarts)

fc20%>%autoplot+
  labs(title = "TSLM Forecast Using Permits & US HPI",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))

Overall the new forecast looks very good. If I was using real predictions of new housing permits and US HPI I think I could create a fairly close forecast of where the Kansas City average home selling price will be at.

ARIMAX

Now that I have created a TSLM model I wanted to create a new ARIMAX model. But since in the last model lagging everything behind listing did not seem to very productive I decided to skip that process in the ARIMAX model.

Models

Unfortunately for the ARIMAX, it will not tell us which variables are significant like the TSLM, so I will have to run all 7 combinations of the 3 variables through the Auto ARIMA function to see which model is the most accurate.

kcvarts <- ts(kchouse[,2:5], start=c(2007,8),frequency=12)

fit22<-auto.arima(kcts,xreg = kcvarts[,2])
fit23<-auto.arima(kcts,xreg = kcvarts[,3])
fit24<-auto.arima(kcts,xreg = kcvarts[,4])
fit25<-auto.arima(kcts,xreg = kcvarts[,2:3])
fit26<-auto.arima(kcts,xreg = kcvarts[,c(2,4)])
fit27<-auto.arima(kcts,xreg = kcvarts[,3:4])
fit28<-auto.arima(kcts,xreg = kcvarts[,2:4])

fit22
## Series: kcts 
## Regression with ARIMA(1,1,1)(0,1,1)[12] errors 
## 
## Coefficients:
##          ar1     ma1    sma1   xreg
##       -0.268  -0.341  -0.816  0.712
## s.e.   0.149   0.137   0.138  0.910
## 
## sigma^2 estimated as 24472982:  log likelihood=-1067
## AIC=2143   AICc=2144   BIC=2157
fit23
## Series: kcts 
## Regression with ARIMA(2,1,1)(2,0,0)[12] errors 
## 
## Coefficients:
##         ar1    ar2     ma1   sar1   sar2  drift    xreg
##       0.136  0.087  -0.675  0.359  0.466  319.3  153.17
## s.e.  0.320  0.187   0.283  0.082  0.093  719.5   72.04
## 
## sigma^2 estimated as 27885348:  log likelihood=-1192
## AIC=2400   AICc=2401   BIC=2422
fit24
## Series: kcts 
## Regression with ARIMA(1,1,1)(2,1,1)[12] errors 
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ma1    sar1    sar2    sma1    xreg
##       -0.047  -0.801  -0.366  -0.053  -0.444  875.60
## s.e.   0.001     NaN     NaN     NaN     NaN   90.44
## 
## sigma^2 estimated as 21861939:  log likelihood=-1057
## AIC=2129   AICc=2130   BIC=2148
fit25
## Series: kcts 
## Regression with ARIMA(2,0,0)(2,0,0)[12] errors 
## 
## Coefficients:
##         ar1    ar2   sar1   sar2  intercept  listing  new_permit
##       0.545  0.401  0.372  0.448     178026    0.509      117.07
## s.e.  0.090  0.091  0.082  0.087      32109    1.004       64.75
## 
## sigma^2 estimated as 29005226:  log likelihood=-1206
## AIC=2427   AICc=2428   BIC=2449
fit26
## Series: kcts 
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors 
## 
## Coefficients:
##          ma1    sar1  listing  us_hpi
##       -0.834  -0.573    0.592   981.8
## s.e.   0.057   0.080    0.537   119.0
## 
## sigma^2 estimated as 23104270:  log likelihood=-1060
## AIC=2130   AICc=2130   BIC=2143
fit27
## Series: kcts 
## Regression with ARIMA(1,1,0)(2,0,0)[12] errors 
## 
## Coefficients:
##          ar1   sar1   sar2  new_permit  us_hpi
##       -0.455  0.350  0.439       89.74   805.4
## s.e.   0.089  0.082  0.088       63.01   349.3
## 
## sigma^2 estimated as 28214224:  log likelihood=-1193
## AIC=2397   AICc=2398   BIC=2414
fit28
## Series: kcts 
## Regression with ARIMA(0,0,0)(2,1,0)[12] errors 
## 
## Coefficients:
##         sar1    sar2   drift  xreg.listing  xreg.new_permit  xreg.us_hpi
##       -0.685  -0.273  296.45         0.539            84.08       679.07
## s.e.   0.098   0.110   50.48         0.492            43.24        55.68
## 
## sigma^2 estimated as 22655118:  log likelihood=-1068
## AIC=2149   AICc=2150   BIC=2168

It seemed like all 7 models are different from each other but have a very similar accuracy in glancing at the AICs of the models.

Residual ACF Plots with No Lag Listing

To see which model did the best for accounting for all the autocorrelation in the dataset, I created ACF plots for the residuals for all the models to see to check for remaining autocorrelation.

p1<-fit22 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Listing", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p2<-fit23 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Permit", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p3<-fit24 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with US HPI", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p4<-fit25 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Listing & Permit", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p5<-fit26 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Listing & US HPI", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p6<-fit27 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Permit & US HPI", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p7<-fit28 %>% residuals %>% ggAcf + 
  theme_minimal() + 
  scale_colour_calc() +
  labs(title = "ARIMAX with Listing, Permit, & US HPI", x = "Time", y = "Residuals")+
  theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

multiplot(p1,p2,p3,p4,p5,p6,p7,cols=1)

In looking at the ACF plots it seems like the 2nd and 4th models are the only ones without any autocorrelation in the residuals, but other models are very close to not having any correlation. But it worth noting that this is better than the first ARIMA models I ran where none of the models created residuals with no autocorrelation.

Residual Box-Ljung Tests with No Lag Listing

Now that we have done the visual test for autocorrelation I want to run a statistical test for autocorrelation in the residuals.

Box.test(residuals(fit22), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit22)
## X-squared = 22, df = 24, p-value = 0.6
Box.test(residuals(fit23), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit23)
## X-squared = 17, df = 24, p-value = 0.9
Box.test(residuals(fit24), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit24)
## X-squared = 14, df = 24, p-value = 0.9
Box.test(residuals(fit25), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit25)
## X-squared = 18, df = 24, p-value = 0.8
Box.test(residuals(fit26), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit26)
## X-squared = 19, df = 24, p-value = 0.8
Box.test(residuals(fit27), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit27)
## X-squared = 27, df = 24, p-value = 0.3
Box.test(residuals(fit28), lag=24, fitdf=0, type="Lj")
## 
##  Box-Ljung test
## 
## data:  residuals(fit28)
## X-squared = 32, df = 24, p-value = 0.1

Though doing the Box-Ljung test it seems like 6 of the 7 models statistically don’t have autocorrelation in the residuals.

Accuracy with No Lag Listing

Now that we know that most models account for the autocorrelation in the data, let’s compare the models for accuracy.

summary(fit22)
## Series: kcts 
## Regression with ARIMA(1,1,1)(0,1,1)[12] errors 
## 
## Coefficients:
##          ar1     ma1    sma1   xreg
##       -0.268  -0.341  -0.816  0.712
## s.e.   0.149   0.137   0.138  0.910
## 
## sigma^2 estimated as 24472982:  log likelihood=-1067
## AIC=2143   AICc=2144   BIC=2157
## 
## Training set error measures:
##                ME RMSE  MAE    MPE  MAPE   MASE     ACF1
## Training set 1156 4583 3450 0.6225 1.989 0.3241 -0.07778
summary(fit23)
## Series: kcts 
## Regression with ARIMA(2,1,1)(2,0,0)[12] errors 
## 
## Coefficients:
##         ar1    ar2     ma1   sar1   sar2  drift    xreg
##       0.136  0.087  -0.675  0.359  0.466  319.3  153.17
## s.e.  0.320  0.187   0.283  0.082  0.093  719.5   72.04
## 
## sigma^2 estimated as 27885348:  log likelihood=-1192
## AIC=2400   AICc=2401   BIC=2422
## 
## Training set error measures:
##                 ME RMSE  MAE    MPE  MAPE MASE     ACF1
## Training set 359.7 5102 4152 0.1199 2.411 0.39 -0.01376
summary(fit24)
## Series: kcts 
## Regression with ARIMA(1,1,1)(2,1,1)[12] errors 
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ma1    sar1    sar2    sma1    xreg
##       -0.047  -0.801  -0.366  -0.053  -0.444  875.60
## s.e.   0.001     NaN     NaN     NaN     NaN   90.44
## 
## sigma^2 estimated as 21861939:  log likelihood=-1057
## AIC=2129   AICc=2130   BIC=2148
## 
## Training set error measures:
##                  ME RMSE  MAE     MPE  MAPE   MASE     ACF1
## Training set -322.9 4290 3208 -0.1865 1.837 0.3013 -0.02273
summary(fit25)
## Series: kcts 
## Regression with ARIMA(2,0,0)(2,0,0)[12] errors 
## 
## Coefficients:
##         ar1    ar2   sar1   sar2  intercept  listing  new_permit
##       0.545  0.401  0.372  0.448     178026    0.509      117.07
## s.e.  0.090  0.091  0.082  0.087      32109    1.004       64.75
## 
## sigma^2 estimated as 29005226:  log likelihood=-1206
## AIC=2427   AICc=2428   BIC=2449
## 
## Training set error measures:
##                 ME RMSE  MAE    MPE  MAPE   MASE     ACF1
## Training set 413.5 5226 4229 0.1249 2.452 0.3973 -0.06411
summary(fit26)
## Series: kcts 
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors 
## 
## Coefficients:
##          ma1    sar1  listing  us_hpi
##       -0.834  -0.573    0.592   981.8
## s.e.   0.057   0.080    0.537   119.0
## 
## sigma^2 estimated as 23104270:  log likelihood=-1060
## AIC=2130   AICc=2130   BIC=2143
## 
## Training set error measures:
##                ME RMSE  MAE     MPE  MAPE   MASE     ACF1
## Training set -512 4453 3376 -0.3191 1.941 0.3171 -0.05139
summary(fit27)
## Series: kcts 
## Regression with ARIMA(1,1,0)(2,0,0)[12] errors 
## 
## Coefficients:
##          ar1   sar1   sar2  new_permit  us_hpi
##       -0.455  0.350  0.439       89.74   805.4
## s.e.   0.089  0.082  0.088       63.01   349.3
## 
## sigma^2 estimated as 28214224:  log likelihood=-1193
## AIC=2397   AICc=2398   BIC=2414
## 
## Training set error measures:
##                  ME RMSE  MAE      MPE  MAPE   MASE    ACF1
## Training set -13.74 5177 4192 -0.07204 2.428 0.3937 -0.1236
summary(fit28)
## Series: kcts 
## Regression with ARIMA(0,0,0)(2,1,0)[12] errors 
## 
## Coefficients:
##         sar1    sar2   drift  xreg.listing  xreg.new_permit  xreg.us_hpi
##       -0.685  -0.273  296.45         0.539            84.08       679.07
## s.e.   0.098   0.110   50.48         0.492            43.24        55.68
## 
## sigma^2 estimated as 22655118:  log likelihood=-1068
## AIC=2149   AICc=2150   BIC=2168
## 
## Training set error measures:
##                  ME RMSE  MAE      MPE  MAPE   MASE   ACF1
## Training set -13.91 4388 3333 -0.04295 1.902 0.3131 0.0973

In looking across all the models it seems like models 3, 5, and 6 have the lowest AIC of the 7 models that were created. And of those models three models, model 5 has the lowest RMSE which is the model that I will use for the creating the final forecast. The ARIMAX model uses listings and the US HPI to forecast future home selling prices in Kansas City.

Forecast Variables

Now that we know which variables we need to create a forecast I am going to use the Auto ARIMA method again to produce some future variables to create a forecast, again I know this forecast is not significant with the amount error that is truly involved making it but it is worth trying.

kclisting<-forecast(auto.arima(kcvarts[,2]),h=36)
kcus_hpi<-forecast(auto.arima(kcvarts[,4]),h=36)

kcvarts7<-as.data.frame(cbind(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kclisting$mean),c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kcus_hpi$mean)))

kcvarts<-ts(kcvarts6, start=c(2017,8),frequency=12)

For this time around we just created forecast for the number of listing and the US HPI again.

Forecast with Confidence Intervals

Now that we have forecasted variables we can create the last forecast using the ARIMAX model.

fc26 <- forecast(fit26, xreg=kcvarts7[,c(2,4)] , h=36)

fc26%>%autoplot+
  labs(title = "ARIMAX with Listing, Permit, & US HPI",subtitle="From 2007 to 2021",x = "Date", y = "Price")+
  theme_minimal() + 
  scale_colour_calc() +
  theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))

The last forecasted model seems to be another reasonable forecast of where average prices are likely to go. But between the ARIMAX model and the TSLM, I prefer using the TSLM model. I like the LM format a lot and using is very easy to understand the effect the seasonality trend and other variables have on the models, and with the forecasted variable the TSLM model produced a lot less a confidence interval, which is good in this case, saying that we have the know variables for new permits and US HPI then we could make a forecast with not a lot of variance in the error.